Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cli): cdk rollback #31407

Merged
merged 14 commits into from
Oct 2, 2024
Merged

feat(cli): cdk rollback #31407

merged 14 commits into from
Oct 2, 2024

Conversation

rix0rrr
Copy link
Contributor

@rix0rrr rix0rrr commented Sep 11, 2024

Add a CLI feature to roll a stuck change back.

This is mostly useful for deployments performed using --no-rollback: if a failure occurs, the stack gets stuck in an UPDATE_FAILED state from which there are 2 options:

  • Try again using a new template
  • Roll back to the last stable state

There used to be no way to perform the second operation using the CDK CLI, but there now is.

cdk rollback works in 2 situations:

  • A paused fail state; it will initiating a fresh rollback (on CREATE_FAILED, UPDATE_FAILED).
  • A paused rollback state; it will retry the rollback, optionally skipping some resources (on UPDATE_ROLLBACK_FAILED -- it seems there is no way to continue a rollback in ROLLBACK_FAILED state).

cdk rollback --orphan <logicalid> can be used to skip resource rollbacks that are causing problems.

cdk rollback --force will look up all failed resources and continue skipping them until the rollback has finished.

This change requires new bootstrap permissions, so the bootstrap stack is updated to add the following IAM permissions to the deploy-action role:

                  - cloudformation:RollbackStack
                  - cloudformation:ContinueUpdateRollback

These are necessary to call the 2 CloudFormation APIs that start and continue a rollback.

Relates to (but does not close yet) #30546.


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

Add a CLI feature to roll a stuck change back.

This is mostly useful for deployments performed using `--no-rollback`:
if a failure occurs, the stack gets stuck in an `UPDATE_FAILED` state
from which there are 2 options:

- Try again using a new template
- Roll back to the last stable state

There used to be no way to perform the second operation using the CDK
CLI, but there now is.

`cdk rollback` works in 2 situations:

- A paused fail state; it will initiating a fresh rollback.
- A paused rollback state; it will retry the rollback, optionally
  skipping some resources.

`cdk rollback --force` will look up all failed resources and continue
skipping them until the rollback has finished.
@rix0rrr rix0rrr requested a review from a team September 11, 2024 14:57
@aws-cdk-automation aws-cdk-automation requested a review from a team September 11, 2024 14:58
@github-actions github-actions bot added the p2 label Sep 11, 2024
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Sep 11, 2024
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request linter has failed. See the aws-cdk-automation comment below for failure reasons. If you believe this pull request should receive an exemption, please comment and provide a justification.

A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed add Clarification Request to a comment.

@aws-cdk-automation aws-cdk-automation added the pr/needs-cli-test-run This PR needs CLI tests run against it. label Sep 11, 2024
@rix0rrr rix0rrr added the pr-linter/exempt-integ-test The PR linter will not require integ test changes label Sep 11, 2024
@rix0rrr rix0rrr added pr/do-not-merge This PR should not be merged at this time. pr-linter/cli-integ-tested Assert that any CLI changes have been integ tested labels Sep 12, 2024
@aws-cdk-automation aws-cdk-automation dismissed their stale review September 12, 2024 14:19

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

@aws-cdk-automation aws-cdk-automation added pr/needs-maintainer-review This PR needs a review from a Core Team Member and removed pr/needs-cli-test-run This PR needs CLI tests run against it. labels Sep 12, 2024
@rix0rrr rix0rrr self-assigned this Sep 25, 2024
packages/aws-cdk/README.md Show resolved Hide resolved
packages/aws-cdk/README.md Outdated Show resolved Hide resolved
print('\n✨ Rollback time: %ss\n', formatTime(elapsedRollbackTime));
} catch (e: any) {
error('\n ❌ %s failed: %s', chalk.bold(stack.displayName), e.message);
throw new Error('Rollback failed (use --force to orphan failing resources)');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to accumulate errors to avoid a poison pill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so I think I was commenting from the position that if a stack cannot be rolled back, we should throw an error - in this case a single faulty stack can prevent rolling back others, so we need to accumulate errors.

If you insist on no-oping for a unrollable stack, this is fine, but I still think we should error out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see that right now you do throw an error on ROLLBACK_FAILED state - doesn't this mean you need to swallow the error here and proceed? to avoid the poison pill?

packages/aws-cdk/lib/cli.ts Outdated Show resolved Hide resolved
packages/aws-cdk/lib/cli.ts Show resolved Hide resolved
* It contains resources r1 and r2, where r1 gets deployed first.
*
* - PHASE = 1: both resources deploy regularly.
* - PHASE = 2: r1 gets updated, r2 will fail to update, and r1 will fail its rollback.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be adjusted for phases 2a and 2b. I'm good with just removing this and let the tests speak for themselves.

packages/aws-cdk/lib/api/deployments.ts Show resolved Hide resolved
print('\n✨ Rollback time: %ss\n', formatTime(elapsedRollbackTime));
} catch (e: any) {
error('\n ❌ %s failed: %s', chalk.bold(stack.displayName), e.message);
throw new Error('Rollback failed (use --force to orphan failing resources)');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I see that right now you do throw an error on ROLLBACK_FAILED state - doesn't this mean you need to swallow the error here and proceed? to avoid the poison pill?

@aws-cdk-automation
Copy link
Collaborator

➡️ PR build request submitted to test-main-pipeline ⬅️

A maintainer must now check the pipeline and add the pr-linter/cli-integ-tested label once the pipeline succeeds.

@aws-cdk-automation aws-cdk-automation removed the pr/needs-maintainer-review This PR needs a review from a Core Team Member label Oct 2, 2024
@rix0rrr rix0rrr removed the pr/do-not-merge This PR should not be merged at this time. label Oct 2, 2024
Copy link
Contributor

mergify bot commented Oct 2, 2024

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 9fcab7c
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify mergify bot merged commit 0755561 into main Oct 2, 2024
11 of 12 checks passed
Copy link
Contributor

mergify bot commented Oct 2, 2024

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot deleted the huijbers/cli-rollback branch October 2, 2024 12:16
Copy link

github-actions bot commented Oct 2, 2024

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
contribution/core This is a PR that came from AWS. p2 pr-linter/cli-integ-tested Assert that any CLI changes have been integ tested pr-linter/exempt-integ-test The PR linter will not require integ test changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants